61 research outputs found
It Takes One to Tango but More Make Trouble? In-Context Training with Different Number of Demonstrations
Large language models (LLMs) are capable to perform complex reasoning by
in-context learning (ICL) when provided with a few input-output demonstrations
(demos) and more powerful when intermediate reasoning steps ("chain of thoughts
(CoT)") of the demos are given. Is it necessary to use multi-demo in ICL? In
this paper, we study ICL using fewer demos for each test query on the tasks
in~\cite{wei2022chain}. Surprisingly, we do not observe significant degradation
when using only one randomly chosen demo. To study this phenomenon, for each
test query, we categorize demos into "correct demos" leading to the correct
answer, and "wrong demos" resulting in wrong answers. Our analysis reveals an
inherent bias in those widely studied datasets: most demos are correct for a
majority of test queries, which explains the good performance of using one
random demo. Moreover, ICL (with and w/o CoT) using only one correct demo
significantly outperforms all-demo ICL adopted by most previous works,
indicating the weakness of LLMs in finding correct demo(s) for input queries,
which is difficult to evaluate on the biased datasets. Furthermore, we observe
a counterintuitive behavior of ICL using multi-demo, i.e., its accuracy
degrades(improves) when given more correct(wrong) demos. This implies that ICL
can be easily misguided by interference among demos and their spurious
correlations. Our analyses highlight several fundamental challenges that need
to be addressed in LLMs training, ICL, and benchmark design
When do you need Chain-of-Thought Prompting for ChatGPT?
Chain-of-Thought (CoT) prompting can effectively elicit complex multi-step
reasoning from Large Language Models~(LLMs). For example, by simply adding CoT
instruction ``Let's think step-by-step'' to each input query of MultiArith
dataset, GPT-3's accuracy can be improved from 17.7\% to 78.7\%. However, it is
not clear whether CoT is still effective on more recent instruction finetuned
(IFT) LLMs such as ChatGPT. Surprisingly, on ChatGPT, CoT is no longer
effective for certain tasks such as arithmetic reasoning while still keeping
effective on other reasoning tasks. Moreover, on the former tasks, ChatGPT
usually achieves the best performance and can generate CoT even without being
instructed to do so. Hence, it is plausible that ChatGPT has already been
trained on these tasks with CoT and thus memorized the instruction so it
implicitly follows such an instruction when applied to the same queries, even
without CoT. Our analysis reflects a potential risk of overfitting/bias toward
instructions introduced in IFT, which becomes more common in training LLMs. In
addition, it indicates possible leakage of the pretraining recipe, e.g., one
can verify whether a dataset and instruction were used in training ChatGPT. Our
experiments report new baseline results of ChatGPT on a variety of reasoning
tasks and shed novel insights into LLM's profiling, instruction memorization,
and pretraining dataset leakage
InstructZero: Efficient Instruction Optimization for Black-Box Large Language Models
Large language models~(LLMs) are instruction followers, but it can be
challenging to find the best instruction for different situations, especially
for black-box LLMs on which backpropagation is forbidden. Instead of directly
optimizing the discrete instruction, we optimize a low-dimensional soft prompt
applied to an open-source LLM to generate the instruction for the black-box
LLM. On each iteration of the proposed method, which we call InstructZero, a
soft prompt is converted into an instruction using the open-source LLM, which
is then submitted to the black-box LLM for zero-shot evaluation, and the
performance is sent to Bayesian optimization to produce new soft prompts
improving the zero-shot performance. We evaluate InstructZero on different
combinations of open-source LLMs and APIs including Vicuna and ChatGPT. Our
results show that InstructZero outperforms SOTA auto-instruction methods across
a variety of downstream tasks. Our code and data are publicly available at
https://github.com/Lichang-Chen/InstructZero.Comment: 15 pages; 9 figures; Our code is available at
https://lichang-chen.github.io/InstructZero
Phy-chemical Attributes of Nano-scale V2O5/TiO2 Catalyst and Its’ Effect on Soot Oxidation
The V2O5 catalysts which supported on nano-scale TiO2 with variation of vanadium contents (5%, 10%, 20% and 40%) were prepared by an incipient-wetness impregnation method. The phase structures of nano-scale V2O5/TiO2 catalysts with different loading rates were characterized by Scanning electron microscope (SEM), X-Ray diffraction (XRD) and Fourier transform infrared (FT-IR) spectra. The oxidation activities of catalysts over diesel soot were performed in a themogravimetric analysis (TGA) system. The kinetics of the catalytic oxidation process were analyzed based on Flynn-Wall-Ozawa method. The characterization results showed that the phase structure of V2O5 supported on TiO2 depends heavily on the vanadium contents, which will put great effects on the catalytic performances for soot oxidation. At a low vanadium loading rates (V5-V20), active species exist as monomers and polymeric states. At a high loading rate (V40), the crystalline bulk V2O5 covers the surface of TiO2. The formed crystal structure occupied the active sites and led a decreasing in the catalytic effect. By comparing the characteristics temperatures of soot oxidation over V2O5 catalysts, the catalytic activities of catalysts with different loading rates for soot oxidation can be ranked as: V5 < V10 < V40 < V20. Via pyrolysis kinetics analysis, it is revealed that the activation energy of soot oxidation is minimum when the vanadium loading rates is 20%, which is fit well with the TG experimental results. The consistency of pyrolysis kinetics and TG experimental results confirm that the best activity catalyst is V20 in discussed catalysts of this paper, which is nearest to the monolayer dispersion saturated state of V2O5/TiO2 catalyst. Moreover, it convincingly demonstrate the obvious threshold effect in V2O5 catalysts.
Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning
Recent advancements in Large Language Models (LLMs) have expanded the
horizons of natural language understanding and generation. Notably, the output
control and alignment with the input of LLMs can be refined through instruction
tuning. However, as highlighted in several studies, low-quality data in the
training set are usually detrimental to instruction tuning, resulting in
inconsistent or even misleading LLM outputs. We propose a novel method, termed
"reflection-tuning," which addresses the problem by self-improvement and
judging capabilities of LLMs. This approach utilizes an oracle LLM to recycle
the original training data by introspecting and enhancing the quality of
instructions and responses in the data. Extensive experiments on widely used
evaluation benchmarks show that LLMs trained with our recycled data outperform
those trained with existing datasets in various benchmarks
Unbiased Watermark for Large Language Models
The recent advancements in large language models (LLMs) have sparked a
growing apprehension regarding the potential misuse. One approach to mitigating
this risk is to incorporate watermarking techniques into LLMs, allowing for the
tracking and attribution of model outputs. This study examines a crucial aspect
of watermarking: how significantly watermarks impact the quality of
model-generated outputs. Previous studies have suggested a trade-off between
watermark strength and output quality. However, our research demonstrates that
it is possible to integrate watermarks without affecting the output probability
distribution with appropriate implementation. We refer to this type of
watermark as an unbiased watermark. This has significant implications for the
use of LLMs, as it becomes impossible for users to discern whether a service
provider has incorporated watermarks or not. Furthermore, the presence of
watermarks does not compromise the performance of the model in downstream
tasks, ensuring that the overall utility of the language model is preserved.
Our findings contribute to the ongoing discussion around responsible AI
development, suggesting that unbiased watermarks can serve as an effective
means of tracking and attributing model outputs without sacrificing output
quality
Task-Aware Sampling Layer for Point-Wise Analysis
Sampling, grouping, and aggregation are three important components in the
multi-scale analysis of point clouds. In this paper, we present a novel
data-driven sampler learning strategy for point-wise analysis tasks. Unlike the
widely used sampling technique, Farthest Point Sampling (FPS), we propose to
learn sampling and downstream applications jointly. Our key insight is that
uniform sampling methods like FPS are not always optimal for different tasks:
sampling more points around boundary areas can make the point-wise
classification easier for segmentation. Towards this end, we propose a novel
sampler learning strategy that learns sampling point displacement supervised by
task-related ground truth information and can be trained jointly with the
underlying tasks. We further demonstrate our methods in various point-wise
analysis tasks, including semantic part segmentation, point cloud completion,
and keypoint detection. Our experiments show that jointly learning of the
sampler and task brings better performance than using FPS in various
point-based networks.Comment: 14 pages, 13 figures and 14 table
AlpaCare:Instruction-tuned Large Language Models for Medical Application
Large Language Models (LLMs) have demonstrated significant enhancements in
instruction-following abilities through instruction tuning, achieving notable
performances across various tasks. Previous research has focused on fine-tuning
medical domain-specific LLMs using an extensive array of medical-specific data,
incorporating millions of pieces of biomedical literature to augment their
medical capabilities. However, existing medical instruction-tuned LLMs have
been constrained by the limited scope of tasks and instructions available,
restricting the efficacy of instruction tuning and adversely affecting
performance in the general domain. In this paper, we fine-tune LLaMA-series
models using 52k diverse, machine-generated, medical instruction-following
data, MedInstruct-52k, resulting in the model AlpaCare. Comprehensive
experimental results on both general and medical-specific domain free-form
instruction evaluations showcase AlpaCare's strong medical proficiency and
generalizability compared to previous instruction-tuned models in both medical
and general domains. We provide public access to our MedInstruct-52k dataset
and a clinician-crafted free-form instruction test set, MedInstruct-test, along
with our codebase, to foster further research and development. Our project page
is available at https://github.com/XZhang97666/AlpaCare
GPT-4 Vision on Medical Image Classification -- A Case Study on COVID-19 Dataset
This technical report delves into the application of GPT-4 Vision (GPT-4V) in
the nuanced realm of COVID-19 image classification, leveraging the
transformative potential of in-context learning to enhance diagnostic
processes
- …